Skip to content

Conversation

divya1974
Copy link
Contributor

…ts added)

Summary
Fix incorrect Series.isin results when comparing signed int64 values with uint64 values that are not equal. Previously, mixing signed and unsigned 64-bit integers could trigger a numeric common-type coercion to float64 which may lose precision and produce false positives. This change prevents that unsafe upcast by preferring an object-based comparison when signed and unsigned integer types are mixed.

Root cause
When isin attempted to find a common numeric dtype between comps (left side) and values (right side), mixing signed int64 with uint64 could lead to casting both sides to float64. Converting large 64-bit integers to float64 loses precision and can make two distinct integers compare as equal.

What I changed
[algorithms.py]:
Adjusted the condition used before converting values to an object array so that when dtypes differ and either side is an unsigned integer, [values] is converted to object (i.e., Python-level equality / hashtable lookup) instead of numeric coercion. This makes the mixed signed/unsigned decision symmetric and avoids unsafe float upcasts.
[test_isin.py]
Added test_isin_int64_vs_uint64_mismatch which reproduces the reported case and asserts the correct False result.

Comment on lines 525 to 527
# If the dtypes differ and either side is unsigned integer,
# prefer object dtype to avoid unsafe upcast to float64 that
# can lose precision for large 64-bit integers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change the performance when values and comps are both integer like and fit within an integer type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix takes the conservative approach of converting values to [object] when mixing signed and unsigned integer dtypes to ensure correctness. This preserves exact integer equality but may be slower for very large arrays compared to a numeric-only path.
This trade-off is favoring correctness over the rare case of very large arrays with mixed signed/unsigned ints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach for better performance can be to remove the earlier asymmetric object-conversion block and add a fast, safe numeric path that correctly handles signed/unsigned mixes without converting to object.

The idea is to use masked uint64 lookups to avoid float casts and preserve performance. I’ll place this fast-path after the comps_array extraction and before the common-type coercion, by mapping signed int64 and uint64 values into the wider unsigned space and performing hashtable lookups on uint64.

This will involve changes roughly around lines algorithms.py+6-16. I’ll also run the new tests afterward to verify behavior. Not sure if that breaks something, should I try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Implicit conversion to float64 with isin()

2 participants